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Abstract 

In this paper, we explore worst-case solutions for the problems of single 
and multiple matching on strings in the word RAM model with word 
length w. In the first problem, we have to build a data structure based on 
a pattern p of length m over an alphabet of size a such that we can answer 
to the following query: given a text T of length n, where each character is 
encoded using log a bits return the positions of all the occurrences of p in 
T (in the following we refer by occ to the number of reported occurrences). 
For the multi-pattern matching problem we have a set S of d patterns of 
total length m and a query on a text T consists in finding all positions of 
all occurrences in T of the patterns in S. As each character of the text is 
encoded using logo - bits and we can read w bits in constant time in the 
RAM model, we assume that we can read up to 0(«;/logcr) consecutive 
characters of the text in one time step. This implies that the fastest 
possible query time for both problems is 0(n^f- + occ). In this paper we 
present several different results for both problems which come close to that 
best possible query time. We first present two different linear space data 
structures for the first and second problem: the first one answers to single 
pattern matching queries in time 0(n( — + ^-^-)+occ) while the second one 
answers to multiple pattern matching queries to 0(n( logd+log ^ +1 ° Blogm + 
+ occ) where y is the length of the shortest pattern. We then show 
how a simple application of the four russian technique permits to get data 
structures with query times independent of the length of the shortest 
pattern (the length of the only pattern in case of single string matching) 
at the expense of using more space. 
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1 Introduction 



The problems of string pattern matching and multiple string pattern matching 
are classical algorithmic problems in the area of pattern matching. In the mul- 
tiple string matching problem, we have to preprocess a dictionary of d strings 
of total length m characters over an alphabet of size a so that we can answer to 
the following query: given any text of length n, find all occurrences in the text 
of any of the d strings. In the case of single string matching, we simply have 
d=l. 

The textbook solutions for the two problems are the Knuth-Morris-Pratt [20] 
(KMP for short) automaton for the single string matching problem and the 
Aho-Corasick [T] automaton (AC for short) for the multiple string matching 
problem. The AC automaton is actually a generalization of the KMP automa- 
ton. Both algorithms achieve 0(n + occ) query time (where occ denotes the 
number of reported occurrences) using 0(m log m) bits of spaccQ (both automa- 
tons are encoded using 0(m) pointers occupying logm bits each). The query 
time of both algorithms is in fact optimal if the matching is restricted to read 
all the characters of the text one by one. However as it was noticed in may 
previous works, in many cases it is actually possible to avoid reading all the 
characters of the text and hence achieve a better performance. This stems from 
the fact that by reading some characters at certain positions in the text, one 
could conclude whether a match is possible or not without the need to read all 
the characters. This has led to various algorithms with so-called sublinear query 
time assuming that the characters of the patterns and/or the text are drawn 
from some random distribution. The first algorithm which exploited that fact 
was the Boyer-Moore algorithm [B]. Subsequently other algorithms with prov- 
ably average-optimal performance were devised. Most notably the BDM and 
BNDM for single string matching and the multi-BDM PHQ31] and multi-BNDM 
[2"5] for multiple string matching. Those algorithms achieve 0(n + occ) 

time for single string matching (which is optimal according to the lower bound 
in |29| ) and 0(n ogc |+ 1 ° s ^ -f occ) time for multiple string matching, where y is 
the length of the shortest string in the set. Still in the worst case those algo- 
rithms may have to read all the text characters and thus have fl (n + occ) query 
time (actually many of those algorithms have an even worse query time in the 
worst-case, namely £l(nm + occ)). 

A general trend has appeared in the last two decades when many papers have ap- 
peared trying to exploit the power of the word RAM model to speed-up and/or 
reduce the space requirement of classical algorithms and data structures. In 
this model, the computer operates on words of length w and usual arithmetic 
and logic operations on the words all take one unit of time. 
In this paper we focus on the worst-case bounds in the RAM model with word 
length w. That is we try to improve on the KMP and AC in the RAM model 
assuming that we have to read all the characters of the text which are assumed 

1 In this paper we quantify the space usage in bits rather than in words as is usual in other 
papers 
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to be stored in a contiguous area in memory using logcr bits per characters. 
That means that it is possible to read Q(w/ log a) consecutive characters of the 
text in O(l) time. Thus given a text of length n characters, an optimal algo- 
rithm should spend 0(n 1 -^^- + occ) time to report all the occurrence of matching 
patterns in the text. The main result of this paper is a worst case efficient al- 
gorithm whose performance is essentially the addition of a term similar to the 
average optimal time presented above plus the time necessary to read all the 
characters of the text in the RAM model. Unlike many other papers, we only 
assume that w — f2(log(n + to)), and not necessarily that w — 0(log(n + to)). 
That is we only assume that a pointer to the manipulated data (the text and the 
patterns), fit in a memory word but the word length w can be arbitrarily larger 
than log to or logra. This assumption makes it possible to state time bounds 
which are independent of to and n, implying larger speedups for small values of 
to and n. 

In his paper Fredriksson presents a general approach |17) which can be applied 
to speed-up many pattern matching algorithms. This approach which is based 
on the notion of super-alphabet relies on the use of tabulation (four russian 
technique). If this approach is applied to our problems of single and multiple 
string matching queries, given an available precomputed space t, we can get a 
log a (t/m) factor speedup. In his paper [5], Bille presented a more space efficient 
method for single string matching queries which accelerates the KMP algorithm 
to answer to queries in time O( log " n +occ) using 0(n 6 + to log to) bits of space 
for any constant e such that < e < 1. More generally, the algorithm can be 
tuned to use an additional amount t of tabulation space in order to provide a 
log^ t factor speedup. 

At the end of his paper, Bille asked two questions: the first one was whether 
it is possible to get an acceleration proportional to the machine word length w 
(instead of log n or log t) using linear space only. The second one was whether 
it is possible to obtain similar results for the multiple string matching problem. 
We give partial answers to both questions. Namely, we prove the following two 
results: 

1. Our first result states that for d strings of minimal length y, we can con- 
struct an index which occupies linear space and answers to queries in time 
Q^ iogd+iogy+iogiogm + l£££) + occ ). T hi s result implies that we can get 

a speedup factor (i og d+iogw) logo- ^ V — Toga ano - & e t the optimal speedup 

factOT Tof? if V > (log d + 1°S°5 T5gT ■ 

2. Our second result implies that for d patterns of arbitrary lengths and an 
additional t bits of memory, we can obtain a factor — -j-n — l° s "* , i — i 

J 1 log a+log log CT r+log log m 

speedup using 0(m log to + t) bits of memory. 

Our first result compares favorably to Bille's and Fredriksson approaches as it 
does not use any additional tabulation space. In order to obtain any significant 
speedup, the algorithms of Bille and Fredriksson require a substantial amount 
of space t which is not guaranteed to be available. Even if such an amount 
of space was available, the algorithm could run much slower in case to <C t as 
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modern hardware is made of memory hierarchies, where random access to large 
tables which do not fit in the fast levels of the hierarchy might be much slower 
than access to small data which fit in faster levels of the hierarchy. 
Our second result is useful in case the shortest string is very short and thus, 
the first result do not provide any speedup. The result is slightly less efficient 
than that of Bille for single string matching, being a factor log log CT t + log log to 
slower (compared to the \og a t speedup of Bille's algorithm). However, our 
second result efficiently extends to multiple string matching queries, while Bille's 
algorithms seems not to be easily extensible to multiple string matching queries. 
The third and fourth results in this paper are concerned with single string 
matching, where we can have solutions with a better query time than what can 
be obtained by using the first and second result for matching a single pattern. 
In particular our results imply the following: 

1. Given a single pattern p of length to, we can construct a data structure 
which occupies a linear space and which can find all occ occurrences of p 
in any text of length n in time 0(n(— + + occ). This implies that 
we can get optimal query time 0(n 1 -^^- + occ) as long as to > j^tj:- 

2. For a single string of length to and having some additional t bits of space, 
we can build a data structure which occupies 0(m log to + t) bits of space 
such that all the occ occurrences of p in any text of length n are reported 
in time 0(n/ \og a t + occ) 

In a recent work [1] , we have tried to use the power of the RAM model to im- 
prove the space used by the AC representation to the optimal (up to a constant 
factor) 0(m log a) bits instead of 0(m log to) bits of the original representation, 
while maintaining the same query time. In this paper, we attempt to do the 
converse. That is, we try to use the power of the RAM model to improve the 
query time of the AC automaton while using the same space as the original 
representation. 

We emphasize that our results are mostly theoretical in nature. The constants 
in space usage and query time of our data structures seem rather large. More- 
over, in practice average efficient algorithms which have been tuned for years 
are likely to behave much better than any worst-case efficient algorithm. For 
example, for DNA matching, it was noted that DNA sequences encountered 
in practice are rather random and hence average-efficient algorithms tend to 
perform extremely well for matching in DNA sequences (see [35] for example). 

2 Outline of the results 

2.1 Problem definition, notation and preliminaries 

In this paper, we aim at addressing two problems: the single string pattern 
matching and the multiple string pattern matching problems. In the single 
string pattern matching problem we have to build a data structure on a single 
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pattern (string) of length m over an alphabet of size a < to 0. In the multiple 
string pattern matching problem, we have a set S of d patterns of total length 
to characters where each character is drawn from an alphabet of size a < m. In 
the first problem, we have to identify all occurrences of the pattern in a text T 
of length n. In the second problem, we have to identify all occurrences of any 
of the d patterns. 

In this paper, we assume a unit-cost RAM model with word length w, and 
assume that w — SI (log to + logn). However w could be arbitrarily larger than 
log to or logn. We assume that the patterns and the text are drawn from the 
same alphabet £ of size a < to. We assume that all usual RAM operations 
(multiplications, additions, divisions, shifts, etc..) take one unit of time. 
For any string x we denote by x[i, j] (or a;[i..j]) the substring of x which begins 
at position i and ends at position j in the string x. For any integer to we note 
by log to the integer number [log 2 to] . 

In the paper we make use of two kinds of ordering on the strings: the prefix 
lexicographic order which is the standard lexicographic ordering (strings are 
compared right-to-left) and the suffix-lexicographic order which is defined in 
the same way as prefix lexicographic, but in which string are compared left- 
to-right instead of right-to-left. The second ordering can be thought as if we 
write the strings in reverse before comparing them. Unless otherwise stated, 
string lengths are expressed in terms of number of characters. We make use of 
the fixed integer bit concatenation operator (•) which operates on fixed length 
integers, where z — x ■ y means that z is the integer whose bit representation 
consists in the concatenation of the bits of the integers x as most significant 
bits followed by the bits of the integer y as least significant bits. We define the 
function sucountx(s), which returns the number of elements of a set X which 
have a string s as a suffix. Likewise we define the function pr count x{s), which 
returns the number of elements of a set X which have a string s as a prefix. We 
also define two other functions surankx(s) and prrankx{s) as the functions 
which return the number of elements of a set X which precede the string s in 
suffix and prefix lexicographic orders respectively. 

2.2 Results 

The results of this paper are summarized by the following two theorems: 

Theorem 1 Given a set S of d strings of total length m, where the shortest 
string is of length y, we can build a data structure of size 0(?nlogTO) bits such 
that given any text T of length n, we can find all occurrences of strings of S in 
T in time 0(n( logd+los ^ +loKlosm + + occ). 

The theorem give us the following interesting corollaries: For multiple string 
matching, we have the following two corollaries: 

2 Our results also apply to the a > m. The only change is in space bounds in which the 
term m log m should be replaced by m log a 



Corollary 1 Given a set S of d strings of total length m where each string is of 
length at least j^^r characters, we can build a data structure of size 0(m log m) 
bits of space such that given any text T of length n, we can find all occurrences 
of strings of S in T in time 0(n ^ log d+l °^ w ^ log a + occ). 

For the case of even larger minimal length, we can get optimal query time : 

Corollary 2 Given a set S of d strings of total length m where each string is 
of length at least (logd + \ogw)^^^ characters, we can build a data structure 
occupying O(mlogm) bits of space such that given any text T of length n, we 
can find all occurrences of strings of S in T in the optimal 0(n^jf- + occ) time. 

The dependence of the bounds in theorem [1] and its corollaries on minimal pat- 
terns lengths is not unusual. This dependence exists also in average-optimal 
algorithms like BDM, BNDM and their multiple patterns variants [121 ITD1 125] . 
Those algorithms achieve a log v J^_ g o g ~ speedup factor on average requiring that 
the strings are of minimal length y. Our query time is the addition of a term 
which represents the time necessary to read all the characters of text in the 
RAM model and a term which is similar to the query time of the average opti- 
mal algorithms. 

We also show a variation of the first theorem which uses the four russian tech- 
nique and which will mostly be useful in case the minimal length is too short: 



Theorem 2 Given a set S of d strings of total length m and an integer parame- 
ter a, we can build a data structure occupying 0(m log m + a a log 2 a log m) bits 
of space such that given any text T of length n, we can find all occ occurrences 
of strings of S in T in time 0(n lQgd+lQSS + loslogm + occ). 

The theorem could be interpreted in the following way: having some additional 
amount t of available memory space, we can achieve a speedup factor log d " log ~ 
for a — \og a t using a data structure which occupies 0(m log m+t) bits of space. 
The theorem gives us two interesting corollaries which depend on the relation 
between m and n. In the case where n > m, by setting t — n e for any < e < 1, 
we get the following corollary: 

Corollary 3 Given a set S of d strings of total length m, we can build a 
data structure occupying 0(m log m + n e ) bits of space such that given any 
text T of length n, we can find all occurrences of strings of S in T in time 
O (n los d+lQg ^ +log log — + occ), where e is any constant such that < e < 1. 

In the case m > n we can get a better speedup by setting t = m: 

Corollary 4 Given a set S of d strings of total length m, we can build a data 
structure occupying 0(m log m) bits of space such that given any text T of length 
n, we can find all occurrences of strings of 5 in time 0(n log d £ log „ 1 ° s m + occ). 



6 



We note that in the case d = 1, the result of corollary [T] is worse by a factor 
loglogg. n + log log to than that of Bille which achieves a query time of O( lo ™ t + 
occ). However the result of Bille does not extend naturally to d > 1. The 
straightforward way of extending Bille's algorithm is to build d data structures 
and to match the text against all the data structures in parallel. This however 
would give a running time of O(rt log d - + occ) which is worse than our running 

time 0{n log d+1 ° s " +1 ° s los m + occ) which is linear in logd rather than d. 

As of the technique of Fredriksson, in order to obtain query time + occ), 
it needs to use at least space £L(ma a ) which can be too much in case a is too 
large. In the case of single pattern matching, we can even get a stronger results 
as we prove the following theorem: 

Theorem 3 Given a string p of length to, we can build a data structure occu- 
pying O(mlogTO) bits of space such that given any text T of length n, we can 
find all occ occurrences of the string s in T in time 0(n(^- + -^jp) + occ). 

An important implication of this theorem is that single pattern matching in 
optimal time 0(n^2- -\- occ) is possible for strings of length to > j^^- 
Similarly to the case of [TJ we can use the four russian technique to improve the 
result of [5] in case to is too short: 

Theorem 4 Given a string p of length to we can build a data structure occu- 
pying 0(?TilogTO + a a log a) bits of space such that given any text T of length 
n, we can find all occ occurrences of strings of S in T in time 0(n^ + occ). 

This last theorem matches the result achieved in Bille's algorithm. 

3 Components 

Before we present the details of our main results we first present the main tools 
and components which are to be used in our solutions. In particular we will 
make use of several data structures and operations which exploit the power 
of the word-RAM model. We first describe some basic operations which will 
be explicitly used for implementing our algorithms. Then we describe some 
classical geometric and string processing oriented data structures which will be 
used as black-box components in our data structures. 

3.1 Bit parallel string processing 

Before we describe the basic bit-parallel operations, we first define how the 
characters are packed in words. We assume that the pattern and the text are 
packed in a similar way. Each character is encoded using logo" bits. The text 
T is thus encoded using a bit array Bt which occupies to log a bits which is 
[to logcr / 'to] words. We thus assume that have a representation of the text T 
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which fits in a word array Wt [E An important technical point is about the 
endianness, that is the way the bits are ordered in a word which influences the 
way the characters are packed in memory. We basically have two possibilities: 
either the bits in a word are ordered from the least to the most significant 
(little endian) or the converse (big endian) . Here we illustrate how a particular 
character T[i] of the text is extracted. We only present the first case as (little 
endian) as the latter can easily be deduced from the former: 

1. First compute io = (ilogc) mod w. 

2. Then read the two words Wq = Wx[[i loga/w\] and W\ = Wr[L(* + 
l)logoyfyJ]. 

3. At last we distinguish two cases: 

• If \i\oga /w\ = [(i + l)loga/w\ (the character i does not span two 
consecutive words), then return (Wo 3> £q) mod a 

• Otherwise (the character spans the two consecutive words Wo and 
Wi) we return (Wo > «o) + (Wi mod 2 lo s CT ~(™-*°)). 

It can easily be seen that the extraction of a character can be done in con- 
stant tome. However, in general we will want to make operations on groups 
of characters instead of manipulating characters one bye one. This permits to 
get much faster operations on strings. In particular we will makes use of the 
following lemma whose proof is omitted and which can easily be implemented 
using standard bit-parallel instructions. 

Lemma 1 Given two strings of lengths m < j^-^ bits, one can compare them 
(for equality) in O(l) time using bit-parallelism. Moreover, given two strings of 
length m, one can compare them in time O(m lo ^ cr ). 

MSB and LSB operations Our solutions for single string matching uses 
the special instruction MSB(x) which returns the most significant bit set in a 
word and similarly LSB(x) which returns the least significant set bit in a word. 
Those two operations can be simulated in constant time using classical RAM 
operations (see [2l [Pol [7]). 

Lemma 2 The two functions MSB(x) and LSB(x) can be implemented in 
O(l) time provided that the bit-string x is of length 0(w) bits. 

Longest repetition matching We will make use of the following tool: given 
a string p of length m and a string s of length n > m where both strings are over 
the same alphabet of size cr, we would wish to have the following two operations: 

3 Notice that when w is not multiple of logc , a character could span a boundary between 
two consecutive words 
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1. Longest prefix repetition matching: find the largest i such that p % (p re- 
peated i times) is a prefix of s. 

2. Longest suffix repetition matching: find the largest i such that p l is a 
suffix of s. 

We argue that both operations can be done in 0(n^^-). First consider the 
computation of Longest prefix repetition of a string p of length m into a string 
s of length n. We have two cases: 

1. Suppose that m log a > w/2. In this case, it suffice to compare successively 
s[mi, m(i + 1) — 1] with p for increasing values of i until we reach i = |_^J 
or find a mismatch. Each comparison takes time 0(1) and thus the whole 
operation takes at most 0(n^f-) time. 

2. Suppose that rnloger < w/2, in this case we first compute k = [ — ^ - j 
and then compute p' — p k and note that w/2 < m'loga < w. Now we 
first compare s[m'j, m'(j + 1) — 1] with p' for increasing values of j until we 
reach j = [^p-J or find a mismatch. Clearly this step takes time 0(n^f-) 
also. Now, we have determined that jk < i < j(k + 1). In the final step 
we compute q = s[m'j,m'(j + 1) — 1] and finally r = (q (Bp') (where © 
denotes the xor operator) and let t — LSB(r) (or t — MSB depending 
on the endiannes or the way the processors orders the bits in its words). 
Now clearly, t is the position of the first bit in which p' and q differ. It is 
clear that the first character in which p' and q differ, is precisely character 
number [t/logcrj. From there we deduce that i = jk + [t/\oga\. The 
computation of the LSB and the xor operator both take constant time. 

The computation of the longest suffix repetition is symmetric to the computation 
of the longest prefix repetition except that we use MSB operation instead of 
LSB or vice-versa depending on the endiannes. 

Lemma 3 Given a string p of length m and a string s of length n where n > m 
the longest prefix (and suffix) repetition of s inp can be found in time 0{m °E a )- 

3.2 Data structures components 

For our results we will use several classical data structures which are illustrated 
with the following lemmata: 

Lemma 4 [55] Given a collection of n intervals over universe U where for any 
two intervals Si and S2 we have either si l~l s 2 = a±, S\ Pl s 2 = s 2 or si l~l s 2 = 
(for any two intervals either one is included in the other or the two intervals are 
disjoint). We can build a data structure which uses 0(n log n) bits of space such 
that for any point x, we can determine the interval which most tightly encloses 
x in O(nloglogrt) time (the smallest interval which encloses x). 
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For implementing the lemma, we store the set of interval endpoints in a pre- 
decessor data structure, namely the Willard's y-fast trie [28] which is a linear 
space version of the Van Emdc Boas tree [37] • Then those points divide the uni- 
verse of size U into 2n + 1 segments and each segment will point to the interval 
which most tightly encloses the segment. Then a predecessor query will point 
to the segment which in turn points to the relevant interval. This problem can 
be thought as a restricted ID stabbing problem (in the general problem we do 
not have the condition that for any two intervals either one is included in the 
other or the two intervals are disjoint). 

Lemma 5 Given a collection S of n strings of arbitrary lengths and a function 
/ from S into [0, m — 1], we can build a data structure which uses O(nlogm) 
bits and which which computes f{x) for any x £ S in time 0(\x\/w) (where |x| 
is the length of x in bits). When queried for any y ^ S the function returns any 
value from the set f(S). 

This result can easily be obtained using minimal perfect hashing [T51I19) . Though 
perfect hashing is usually defined for fixed 0(w) bits integers, a standard string 
hash function [T3] can be used to first reduce the strings to integers before 
constructing the minimal perfect hashing on the generated integers. 

Lemma 6 [9, Theorem 1] Given a collection S of n strings of variable lengths 
occupying a memory area of m characters (the strings can possibly overlap), we 
can build an index which uses 0(n log m) bits so that given any string x, we can 
find the string s £ S which is the longest among all the strings of S which are 
prefix of x in time 0(\x\/w + logn) (where \x\ is the length of x in bits). More 
precisely, the data structure returns prranks(s). Moreover the data structure 
is able to tell whether x = s. 

This result which is obtained using a string B-tree |14] combined with an LCP 
array and a compacted trie |21| built on the set of strings, and setting the block 
size of the string B-tree to 0(1). The following lemma is symmetric of the 
previous one. 

Lemma 7 Given a collection S of n strings of variable lengths occupying a 
memory area of m characters of space (the strings can possibly overlap), we can 
build an index which uses O(nlogm) bits so that given any string x, we can 
find the string s 6 S which is the longest among all the strings of S which are 
suffix of x in time 0(\x\/w + logn) (where |x| is the length of x in bits). More 
precisely, the data structure returns suranks(s). Moreover the data structure 
is able to tell whether x = s. 

Lemma 8 [8 J Given a set of n rectangles in the plane, we can build a data 
structure which uses O(nlogn) bits of space so that given any point [v,z], we 
can report all the k occurrences of rectangles which enclose that point in time 
0(logn + k). 
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The problem solved by lemma [S] is called the 2D stabbing problem or sometimes 
called the planar point enclosure. The lemma uses the best linear space solution 
to the problem which is due to Chazelle [H] (which is optimal according to the 
lower bound in |24j). 

4 Multiple string matching without tabulation 

4.1 Overview 

The goal of this section is to show how we can simulate the running of the AC 
automaton [1J, by processing the characters of the scanned text in blocks of 
b characters. The central idea of the relies on a reduction of the problem of 
dictionary matching to the ID and 2D stabbing problems, in addition to the 
use of standard string data structures namely, string B-trees, suffix arrays and 
minimal perfect hashing on strings. At each step, we first read b characters 
of the text, find the matching patterns which end at one of those characters 
and finally jump to the state which would have been reached after reading 
the b characters by the AC automaton (thereby simulating all next and fail 
transitions which would have been traversed by the standard AC automaton 
for the b characters). Finding the matching patterns is reduced to the 2D 
stabbing problems, while jumping to the next state is reduced to ID stabbing 
problem. The geometric approach has already been used for dictionary matching 
problem and for text pattern matching algorithms in general. For example, 
it has been recently used in order to devise compressed indexes for substring 
matching |18l l2"2l [§] . Even more recently the authors of [26j have presented a 
compressed index for dictionary matching which uses a reduction to 2D stabbing 
problem. 

4.2 The data structure 

We now describe the data structure for in more detail. Given the set S of d 
patterns, we note by P the set of the prefixes of the patterns in S (note that 
\P\ < m + 1). It is a well-known fact that there is a bijective relation between 
the set P and the set of states of the AC automaton. We use the same state 
representation as the one used in [4|. That is we first sort the states of the 
automaton in the suffix-lexicographic order of the prefixes to which they cor- 
respond, attributing increasing numbers to the states from the interval [0, m]. 
Thus the state corresponding to the empty string gets the number 0, while the 
state corresponding to the greatest element of P (in suffix-lexicographic order) 
gets the largest number which is at most m. We define state(p) as the state 
corresponding to the prefix p G P. 

Now, the characters of the scanned text, are to be scanned in blocks of b char- 
acters. For finding occurrences of the patterns in a text T, we do \n/b~\ steps. 
At each step i £ [0, \n/b~\ — 1] we do three actions: 
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• Read 6 characters of the text, T[ib, (i + 1)6 — 1] (or n — ib < b characters 
of the text, T[ib, n) in the last step). 

• Identify all the occurrence of patterns which end at a position j of the 
text such that j £ [ib, (i + 1)6) (j £ [ib, n) in the last step). 

• If not in the last step go to the next state corresponding to the longest 
element of P which is a suffix of T[0, (i + 1)6]. 

The details of the implementation of each of the last two actions is given in 
sections IQlandHUl 

Our AC automaton representation has the following components: 

1. An array A which contains the concatenation of all of the patterns. This 
array clearly uses mb bits of space. 

2. Let Po<j<6 be the set of prefixes of S of lengths in [1,6]. We use an 
instance of lemma [6l which we denote by B\ and in which we store the 
set Po<i<b (by means of pointers into the array A). Clearly B\ uses 
0(dblogm) — 0(m\ogm) bits of space (we have db elements stored in 
B\ and pointers into A take log to bits). We additionally store a vector 
of |Po<i<fc| < db elements which we denote by T\ and which associates 
an integer in [0,to) with each element stored in B\. The table T\ uses 
O(dblogm) — 0(m log to) 

3. We use an instance of lemma [SJ which we denote by B2 and in which 
we store all the suffixes of strings in P (or equivalently all factors of the 
strings in S) of length 6 and for each suffix, store a pointer to its ending 
position in the array A (if the same factor occurs multiple times in the 
S we store it only once). As we have at most m elements in P and each 
pointer (in the array ^4) to each factor can be encoded using O(logTO) 
bits, we conclude that Bi uses at most O(TOlogm) bits of space. 

4. We use an instance of lemma [7] which we denote by B3 and in which we 
store all the suffixes of strings of S of lengths in [1, 6] (We note that set by 
C^o<i<fc)- It can easily be seen that B3 also uses 0(d61ogTO,) = 0(m log to) 
bits of space. 

5. We use a ID stabbing data structure (lemma [3} in which we store to 
segments where each segment corresponds to a state of the automaton. 
This data structure which uses 0(m log to) bits of space is used in order 
to simulate the transitions in the AC automaton. We also store a vector of 
integers of size m which we denote by T2 and which associates an integer 
with each interval stored in the II? stabbing data structure. The table T2 
uses 0(m log to) bits of space. 

6. We use a 2D stabbing data structure (lemma [S]) in which we store up to 
db rectangles. The space used by this data structure is 0(dblog(db)) = 
0(mlogm) bits. We also use a table T3 which stores triplets of integers 
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associated with each rectangle. The table T3 will also use 0(db login) = 
0(m log m) bits. 

We deffer the details about the contents of each component to the full version 
which uses also to the full version. Central to the working of our data structure 
is the following technical lemma: 

Lemma 9 Given a set of strings X. We have that for any two strings x £ X 
and y £ X : 

• prrankx{y) £ [prrankx(x),prrankx(x)+prcountx(x) — 1] iff x is a prefix 
of y. 

• surankx(y) £ [surankx (x), surankx(x)+sucountx(x) — 1] iff x is a suffix 
of y. 

The proof of the lemma is omitted. 
4.3 Simulating transitions 

We will use the representation of states similar to the one used in [4] . That is 
each state of the automaton corresponds to a prefix p £ P and is represented as 
an integer state{p) = surankp{p). The main idea for accelerating transitions is 
to read the text into blocks of size b characters and then find the next destination 
state attained after reading those b characters using B\, 7\, B 2 , T 2 and the ID 
stabbing data structure. More precisely being at a state state(p) and after 
reading next b characters of the text which form a string q, we have to find next 
state which is the state state(x) such that x £ P is the longest element of P 
which is suffix of pq. For that purpose the ID stabbing data structure is used 
in combination with B\ (which is queried on string q) in order to find state(x) 
in case > b. Otherwise if no such x is found the data structure B 2 will be 
used to find state(x), where |x| < b. The following lemma summarizes the time 
and the space of the data structures needed to simulate a transition. 

Lemma 10 We can build a data structure occupying 0(m log m) bits of space 
such that if the automaton is in a state ti , the state fj+b reached after doing all 
the transitions on b characters, can be computed in 0(log d + log b + log log m + 
Qgcr ) time. 

The current state of the AC automaton is actually represented as a value cur £ 
[0, m]. At the beginning the automaton is at state cur = 0, and we read the 
text in blocks of b characters at each step. At the end of each step we have to 
determine the next state reached by the automaton which is represented by the 
number next £ [0, m]. We now show how the transitions of the AC automaton 
for a block of b characters are simulated. Suppose that we are at step i and 
the automaton is in the state state{p) corresponding to a prefix p. Now we 
have to read the substring q = T[ib, (i + 1)6 — 1] and the next state to jump to 
after reading q, is the state state(x) corresponding to the longest element x of 
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P which is a suffix of pq. 

For simulating transitions we use Bi,T\, B2, T2 and the ID stabbing data 
structure. The table T\ associates to each element of Po<i<& (each element of 
P whose length is in [l,b]) sorted in suffix lexicographic order the identifier 
of the states to which they correspond. That is for each x £ Po<i<b we set 
Ti[surankp 0<i<b (x)] = state(x). We recall that given any element x £ Po<i<b, 
surankp 0<i<b (x) can be obtained by querying B\ for the element x. 
The ID stabbing data structure (lemma which is built on numbers occupying 
2 log(m + 1) bits each, stores m intervals each of which is defined by two points, 
where each point is defined by a number which occupies 21og(m + 1) bits. Let 
x £ P be decomposed by x = p'q' where q' is the suffix of x of length b and p' is 
the prefix of x of length \x\ — b. Let ID(q') be the pointer associated with q 1 in 
R>2 (recall that B2 associates a unique pointer in A for each occurring factor q' of 
elements of S). We store in the ID stabbing data structure the interval [To, h], 
where I = ID(q') ■ state(p') and I\ — ID(q') ■ (state(p r ) + sucountp{p') — 1) 
(recall that sucountp(jp') is the number of elements of P which have p' as a 
suffix). The ID stabbing data structure naturally associates a unique integer 
identifier from [0, m] with each interval stored in it. We additionally use a table 
T2 of size m indexed with the interval identifiers. More precisely, let j be the 
identifier corresponding to the interval associated with the state state{p) for 
p £ P. We let T 2 [j] = state(p). That way once we have found a given interval 
from the ID stabbing data structure, we can index into table T2 in order to find 
the corresponding state. 

Now queries will happen in the following way: At step i, we are at state stateijp) 
corresponding to a prefix p and we are to read the sequence q = T[ib, (i + 1)6 — 1] , 
and must find the longest element of P which is a prefix of pq. For that, we do 
the following steps: 

1. We first query B2 for the string q which will return a unique identifier 
ID(q') which is in fact a pointer to the ending position of a factor q' . Now, 
we compare q' with q. If they are not equal, we go to step 5, otherwise we 
continue with the next step. 

2. We query the ID stabbing data structure for the point ID(q) ■ state(p) 
This query returns the interval (identified by a variable j) which most 
tightly encloses the point ID(q) ■ state(p) if it exists. This interval (if it 
exists) corresponds to a prefix x = p'q' of P such that q' — q and p' is 
the longest element of P which is a prefix of p. If the query returns no 
interval, we conclude that we have no element of P of length > b which is 
a suffix of pq and go to step 5, otherwise we continue with the next step. 

3. We retrieve T 2 [j] which gives us the destination state which concludes the 
transition. 

4. At this step we are sure that no element of P of length at least & is a 
suffix of pq. We thus do a query on Bi for the string q in order to find the 
longest element of P which is a suffix of pq. Note that this element must 
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be of length < b and thus must be stored in B\ and also must be a suffix 
of q. Let ID(q) be the identifier of the returned element. 

5. By reading T± we retrieve the identifier of the destination state which is 
given by Ti[ID(q)]. This concludes the transition. 

We now give a formal proof of lemma [TU] 

Proof. We now prove that the above algorithm effectively simulates b con- 
secutive transitions in the automaton. Recall that we are looking for the state 
corresponding to the longest element x € P which is a suffix of pq. After we 
have read the string q, we query the data structure E>2 to retrieve a pointer 
ID(q') to a string q' which is a factor of some string in S (or equivalently a 
suffix of some element in P). Then we compare q with q' in time 0(b\oga/w). 
Now we have two cases: 

• The comparison is not successful, we conclude that no prefix in P has q 
as a suffix and hence the element x G P must be shorter than b (otherwise 
it would have had q as a suffix) . That means that x is a suffix of q (x is a 
suffix of pq shorter than q) and hence has length at most b. Hence we go 
the step 4 to query B\ for the string q in order to retrieve x. 

• The comparison is successful, in which case we know there exists at least 
one element of P, which has q as a suffix. Now we go to step 2 , querying 
the ID stabbing for the point K = ID(q) ■ state(p). The query returns an 
interval [Iq, I±] where 1$ — ID(q') ■ state(p') and I\ = ID(q') ■ (state(p') + 
sucountp(jp') — 1) for some prefixes p' and q' . Now it can easily be proven 
that q' = q and that p'q is the longest element of P which is a suffix of 
pq. This is proved by contradiction. By lemma [5] we have that p' must be 
a suffix of p, and we suppose that the longest suffix is p" ^ p' having an 
associated interval [Jo, Ji] and K 6 [Jo, Ji]. By definition p' is a suffix of 
p" and thus by lemmaO [Jo, Ji] Q [Io,h] which contradicts the fact that 
[/o, Jl] is among all the intervals stored in the ID data structure the one 
which most tightly encloses K (which is implied by lemma 2]). 

Now, in the first case, we go to step 4 in order to find the longest prefix in 
P which is a suffix q. In the second case, we go to step 2 looking among the 
elements which have q as a suffix for the longest one which is a suffix of pq. If 
the search is unsuccessful, we conclude that no such element x exists and thus 
x must be shorter than q and thus go to step 4 to find the longest prefix in P 
which is a suffix of q. 

The total space usage is clearly 0(m log to) bits as each of Bi,B 2 ,Ti,T 2 and the 
ID stabbing data structure uses O(TOlogm) bits. 

Concerning the query time, it can easily be seen that the steps 3 and 5 take 
constant time, step 1 takes time 0( s ° ), step 2 takes time O(loglogm) and 
finally step 4 takes 0( sa + logrf + log 6) time. Summing up, the total time 
for a transition is 0( ser + log d + log b + log log m) . I 
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4.4 Identifying matching occurrences 

In order to identify matching patterns the 2D stabbing data structure is used 
in combination with B\. 

Lemma 11 Given a parameter 6 and a set S of variable length strings of total 
length m characters over an alphabet of size a, we can build a data structure 
occupying space 0(m log to) bits, such that if the automaton is at a state ti 
after reading i characters of a text T, all the occi matching occurrences of T 
which end at any position in T\i, i + b] (or T[i, \T\ — 1] if i + b > \T\) and begin 
at any position in T[0, i] can be computed in 0(log d + log b + fcl *^ CT + ocq) time. 

In order to find the matching pattern occurrences at each step, we use B3, the 
table T3 and the 2D stabbing data structure. Initially the automaton is at 
state 0, we read the first b characters of the text, T[0, b) and must recognize all 
occurrences which end in any position j £ [0,6). Note that in this first step, 
any occurrence must end at position 6—1 (This is the case, because we have 
assumed that 6 is no longer than the length of the shortest pattern). Then at 
each subsequent step i, we read a block T[ib, (i + 1)6) (or the block T[ib,n) 
in the last step) and must recognize all the occurrences which end at position 
j £ [ib, (i + 1)6) (or j £ [ib, (i + 1)6) in the last step). Suppose that at some 
step i we are at a state state(p) corresponding to a prefix p and we are to read 
the block q = T[ib, (i + 1)6) (q = T[ib, n) in the last step). It is clear that any 
matching occurrence must be a substring of pq (the string p concatenated to the 
string q) and moreover, that substring must end inside the string q. In other 
words, any occurrence x is such that x = p'q', where p' is suffix of p and q' is 
prefix of q. 

Identifying the pattern involves first computing a point [x p ,y q ] where x p = 
state{p) and y q = prrankjj 0<i<b {q) is computed by querying B3 for q, then 
querying the 2D stabbing data structure in order to get all the rectangles which 
enclose [x p ,y q ] as integer identifiers, where each reported rectangle represents 
one occurrence of one of the patterns. Finally using the table T3, we can get the 
matching pattern identifiers along their starting and ending positions. We now 
describe how the set of rectangles is built. For each pattern s G S of length I 
we insert 6 rectangles. Namely, for each % £ [1,6] we insert the rectangle which 
is defined by the two intervals: 

• Let p' be the prefix of s of length I — i. Let R = state(p') = surankp(p') 
be the state corresponding to p' (or equivalently the rank of p 1 in suffix- 
lexicographic order relatively to the set P) and let c = sucountp{q') be 
the number of elements of P which have the string p' as a suffix. The first 
interval is given by [i?, R + c — 1] . 

• Let q' be the suffix of s of length i. Let ID(q') = prranku 0<i<b (<?') be the 
unique identifier returned by B3 for q' (recall that B3 stores all suffixes 
of lengths at most 6 of elements of P). Let c = prcountjj 0<i<b {q') be 
the number of elements of t/o<«<& which have q' as a prefix. The second 
interval is given by [ID(q'), ID(q') + c — 1]. 
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The 2D stabbing data structure returns a unique identifier j £ [0,db — 1] 
corresponding to each rectangle. Additionally with the rectangle, we asso- 
ciate a triplet (7, \p'\, \q'\) which is stored in table T3 at position T3[j], where 
I £ [0, d — 1] is the unique integer identifier of the pattern s. This table thus 
uses 0(db log m) — 0(m log m) bits of space. 

Now queries will happen in the following way: suppose that we are at state 
state(p) corresponding to a prefix p and we are to read the block q = T[ib, (i + 
1)6). We first query B3 for the string q giving us an identifier ID(q') = 
prranku 0<i<b (q') corresponding to the longest element q' £ C/o<i<fc such that q' 
is prefix of q. Then we do a 2D stabbing query for the point (state(p), ID(q')). 
Now for every found rectangle identified by an integer j, we retrieve the triplet 
(7, \p'\, \q'\) from Ts[j]. Now the reported string has identifier 7, and matches 
the text at positions [ib — \p'\,ib + \q'\ — 1]. We now give a formal proof of 
lemma [TT] 

Proof. We now prove that the above procedure reports all (and only) matching 
occurrences. For that it suffices to prove that there exists a bijection between 
occurrence and reported rectangles. It is easy to see that each occurrence s 
which begins in T[0,i] and ends in T[i,i + b] can be decomposed as s = p'q' ' . 
where p' is a suffix of T[0, i — 1] (p' can possibly be the empty string) and 
q' is a prefix of T[i,i + b]. Then as s £ S, we can easily deduce that q' £ 
Uo<i<b and p' £ P. It is also easy to see that p' is a suffix of p. Let q £ 
U{)<i<b be the longest element in C/o<i<6 which is a prefix of T[i,i + b]. It 
is easy to see that q' must be a prefix of q. Thus according to lemma [5] we 
have that surankp(p) £ [surankp(p'), surankp{p r ) + sucountp{p') — 1] and 
prranku 0<i < b (q) £ [prrank Uo<i < b (q'),prrank Uo<i < b (q r ) + prcount Uo<i < b (q') - 1]. 
Now recall that B3 returns prranku 0<i<b (q) and the 2D stabbing query is done 
precisely on the point [state(p),prranku 0<i<b (q)], which will thus return all (and 
only) the rectangles corresponding to occurrences. I 

By combining lemma Qj] and lemma [TOl we directly get theorem Q] by setting 
b = y where y is the length of the shortest string in the set S. At any step, we 
do the following: 

1. Read the characters T[i, j], where we set i — Ib and we set j = (7 + 1)6—1 
if n > (J + 1)6 — 1 and j = n — 1 otherwise. 

2. Recognize all the pattern occurrences which start at positions any position 
i' < i and which terminate at positions in using lemma [TT1 

3. Increment step I by setting 7 = 7+1. Then if 76 < n, stop the algorithm 
immediately. 

4. Do a transition using lemma [TOl and return to action 1. 
4.5 Analysis 

Theorem Q] is obtained by combining lemma [10] with lemma 111! Namely by 
setting 6 = y, where y is the shortest pattern in S in both lemmata we can 
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simulate the running of the automaton in \n/y] steps at each step i, spending 
O (log d + log b + log log m + v ) + occi ) to find the ocq matching occurrences 
(through lemma ITT]) and 0(log<i + log 6 + log log m + blo Jf a ) time to simulate 
the transitions (through lemma HO)) . Summing up over all the \n/y] steps, we 
get the query time stated in the theorem. We can now formally analyze the 
correctness and space usage of theorem [TJ 



Correctness The correctness of the query is immediate. If can easily be seen 
that at each step /, we are recognizing all occurrences which end at any position 
in [lb, 1(1 + 1) — 1] (or [lb, n — 1] in the last step). That is at any step I we 
can use lemma [TTJ to recognize all the occurrence which end at any position in 
[lb, (I + l)b — 1] (or [lb, n — 1] in the last step) and start at any position i 1 < lb. 
Also at each step I, we are at state stib reached after reading lb characters of 
the text, and lemma fTDl permits us to jump to the state st^+i^ which is reached 
after reading b characters. 



Space usage Summing up the total space usage by the theorem is 0(m log to) 
bits as both lemma ITTJI and lemma [TTJ use 0(m log m) bits. 



4.6 Consequences 

Theorem [TJ states that we can use 0(m log to) bits of space to identify all occur- 
rence of length at least y in a text T of length n in time 0(n( losd+1 ° Ky y +loKlogm + 

+ occ). If we suppose that all the patterns are of length at least w bits 
(log^F cnarac ters) , then by setting y = j^^, we obtain an index which answers to 

• • 4-- ml /iog^Oog^+log ISF^+Ioglog™ 1 ) . losers , \ \ u l 

queries m time U(n( ^f 6 " 2 1 — J r )+occ). As we have log to < 

w and log < log to, the query time simplifies to 0(n l ° scr< ' 1 ° gd ^ logloEW * > +occ). 
This gives us corollary [JJ An important implication of theorem [JJ is that it 
is possible to attain the optimal 0(n^^- + occ) query time in case the pat- 
terns are of sufficient minimal length. Namely if each pattern is of length at 
least (log d + log w) words (that is w ( l °s^+iogw) cnarac ^ ers ) ( then by setting 
y = w ( l °g^+k>gw) j n theorem [TJ we obtain a query time of 0(?i^f- + occ). This 
gives us corollary [21 



5 Tabulation based solution for multiple-string 
matching 

We now prove thcorcm[21 A shortcoming of theorem[JJis that it gives no speedup 
in case the length of the shortest string in S is too short. In this case we resort to 
tabulation in order to accelerate matching of short patterns. More specifically, 
in case, we have a specified quantity t of available memory space (where t < 
2 W as obviously we can not address more than 2 W words of memory), we can 
precompute lookup tables using a standard technique known as the four russian 
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technique [3] so that we can handle queries in time 0(n log d+1 ° s t t +1 ° s log m + 
occ). In theorem Q] our algorithm reads the text in blocks of size b — y, where 
y is the length of the shortest pattern. In reality we can not afford to read 
more than y characters at the each step, because by doing so we may miss a 
substring of the block of length y. Thus in order to be able to choose a larger 
block size 6, we must be able to efficiently identify all substrings of any block 
of (at most) b characters which belong to S. The idea is then to use tabulation 
to answer to such queries in constant times (or rather in time linear in the 
number of reported occurrences). More in detail, for each possible block of 
u < b characters, we have a total of (u — l)(u — 2)/2 substrings which could 
begin at all but the first position of the block. For each possible block of u 
characters, we could store a list of all substrings belonging to S and each list 
takes at most (u — l)(u — 2)/2 = 0(u 2 ) pointers of length logm bits. As we 
have a total of <j u possible characters, we can use a precomputed table of total 
size t = 0((a M )u 2 logm) bits. 

Lemma 12 For a parameter u < ew/loga (where e is any constant such that 
< e < 1) and a set S of patterns where each pattern is of length at most u, we 
can build a data structure occupying 0{a u log 2 u logm) bits of space such that 
given any string T of length it, we can report all the occ occurrences of patterns 
of S in T in 0(occ) time. 

Theorem [2] is obtained by combining lemmata 1 1 01 1111 and 1121 Suppose we are 
given the parameter a; for implementing transitions, we can just use lemma [TOl 
in which we set b — s, where the transitions are built on the set containing 
all the patterns. Now in order to report all the matching strings, we build an 
instance of lemma [Tl] on the set S and in which we set b = s and also build 
a — 1 instances of lemma [12] for every u such that 1 < u < a. More precisely let 
S< u be the subset of strings in 5* of length at most u, then the instance number 
u will be built on the set S< u using parameter u and will thus for all possible 
strings of length u, store all matching patterns in S of length at most u. 
A query on a text of T will work in the following way: we begin at step 1 = and 
the automaton is at state which corresponds to the empty string. Recognizing 
the patterns will consist in the following actions done at each step I: 

1. Read the substring T[i,j], where i = lb and j = (I + 1)6 — 1 (or j = n — 1 
if n> (7 + 1)6-1). 

2. Recognize all the pattern occurrences which start at any position i' < i 
and which terminate at any position j' £ using lemma [TO 

3. Recognize all the matching strings of lengths at most 6 which are sub- 
strings of T[i + 1, j] using the instance number j — i of lemma [T2l 

4. Increment step / by setting 1 = 1+1. Then if lb > n, stop the algorithm 
immediately. 

5. Do a transition using lemma ITOl and return to action 1. 
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5.1 Analysis 

Correctness The correctness of the transition is immediate. If can easily be 
seen that at each step I, we are recognizing all occurrences which end at any 
position in [76, (I + 1)6 — 1] (or [lb, n — 1] in the last step). That is at any step 
I: 

• Lemma [TT1 recognizes all occurrence which end at any position in [lb, (I + 
1)6 — 1] and start at any position i' < lb. 

• Lemma Q~2] recognizes all patterns which end at any position in [Ib+l, (1+ 
1)6 — 1] and start at any position i' such that i 1 E [lb + 1,(1 + 1)6 — 1] 
using instance number a — 1 of the lemma for all but the last step. 

• The last step of the algorithm recognizes all occurrences which end at any 
position in [lb + 1, n — 1] and start at any position i' £ [lb + 1, n — 1] using 
the instance number u = n — lb — 1 of lemma 1121 

Thus at the last step, we will have recognized all the occurrences of patterns in 
the text T. 

Space usage It can easily be seen that the space used by lemma [TU] is in 
fact 0(m log m). The space used by lemma [TT1 is 0(da log(da)) = O(mlogm) 
bits. The space used by all the instances of lemma [T^] is bounded above by 
0(a a log 2 alogm). That is for each u £ [1, a — 1], the instance number u uses 
O(o u log 2 u log m) < c(a u log 2 u log m) for some constant c. Thus the total space 
usage is upper bounded by 

s-l 

c(y^^ <j u log 2 u log m) 

u=l 

As we have a > 2, the total space used by lemma [T2l can be upper bounded by 
2c(ct" _1 log 2 (a — 1) log m) = 0(a s log 2 a log m). Summing up the space used by 
the three lemmata we get 0{m log m + a a log 2 a log m) bits of space. 

5.2 Consequences 

Corollary [3] derives easily from the theorem. That is, in the case n > m, we can 
set a — c\og a n for some constant c < e/2 for any e € (0, 1). Then space usage 
becomes 0(m log m + a a log 2 alogm) = 0(m log m + n c (log log a n) 2 log m)) = 
0(m log m + n e ). 

Similarly corollary U derives immediately from the theorem. By setting a = 
clog CT m for some constant c < 1, the space usage becomes 0{m log m+a a log 2 a log 
0(m log to + m c (loglog (T m) 2 log to)) = 0(m log to) bits of space. 
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6 Single string matching 



We now turn to the proof of theorems [3] and 21 In both theorem we only have 
to match a single pattern p of length m against the text T of length n. We 
first describe the matching algorithm used in theorem [3] then sketch a possible 
way to construct the data structure used in the matching algorithm. We finally 
show the proof of @] which is based on the data structure of theorem [3] combined 
with the use of the four russian technique. 

6.1 The matching algorithm 

Our string matching algorithm employs properties of periodic strings. For im- 
plementing the string matching we will use a sliding window of size m + h where 
h = \m/3\ and at each step i, we shift the window by h + 1 characters and 
spend time 0(m 1 -^^- + 1). Thus the total running of the string matching will 
clearly be 0(f (m^ + 1)) = 0(n(i + ^)). We note by P the set of all 
factors of the string p of length m — h. Notice that \P\ < h + 1 (possibly the 
same factor could occur multiple time in the string p ). At any step i and we 
consider a text windows W = T[i(h + 1), (i + l)(h + 1) + m] and match every 
string which appears in the window. In other words match every string which 
starts at any position in W[0, h] = T[i(h+ 1), (i + l)(h + 1) — 1] will be matched. 
Now let q = W[h,m — 1] (note that |g| = m — h = [2m/3]). We will match q 
against all factors of p of length m — h. Note that for a match to be possible we 
must necessarily find that q S P (if q is not a factor of p then we can not have a 
match). Every occurrence which begins at any position in W[0, h] must end at 
a position in W[m — 1, h + m — 1]. If the pattern q occurs a single time in the 
windows W, then we just have to do a single comparison which takes optimal 
time. Thus from now on we concentrate on the case where the factor q occurs 
multiple times in the pattern p. Before detailing the way the matching is done, 
we first prove the following lemma: 

Lemma 13 If q appears i times in W and i' times in p, then p occurs at most 
i — i' + 1 times in W 

Proof. Let qi,q2, , , ,Qi be the sequence of consecutive appearances of q in W. 
Then any occurrence of p in W must span exactly a sequence of i' consecutive 
occurrences. As we exactly have i — i' + 1 sequences of length i' in a sequence 
of length i, we deduce that we have at most i — i' + l occurrence of p in W. I 

Thus our first step of the matching will be to count the number of occurrences 
of q in W which gives us an upper bound on the number of occurrence of p in 
W. 

Counting occurrences of q in W First note that in case the factor q appears 
more than one time in p, then its (shortest) period is necessarily shorter than 
|g| = m — h > \q\ > 2 and can be uniquely decomposed as q — (uvYu where 



21 



t > 2, \v\ > 1 and is the length of the (shortest) period of g@. This can 
easily be explained: as \q\ > 2\p\/3 (that is we have \q\ = m — h = \2m/3] > 
2m/ 3) then necessarily any two occurrences of q in p are separated by at most 
to — h < \p\/3 > \q\/2 characters. That means that q has one period of length 
at most \q\/2 and thus the (shortest) period of q is at most \q\/2. By a well 
known result (see Crochemore et al's book [TTJ Lemma 1.6]) we deduce that all 
periods of length at most \q\/2 of q are multiple of the (shortest) period. We 
now describe the way the number of occurrences of q in W are counted. Let 
q> = W[0, h - 1], q" = W[m, h + m-l] (that is W = q'qq") and g = \uv\. For 
counting we first do a longest suffix repetition search for uv in q' and then do 
a longest prefix repetition search of vu in q" returning two numbers i' and i" 
respectively. We now deduce that we have exactly c Qy w = i' + i" + 1 occurrences 
of q in W: 

Lemma 14 The algorithm above correctly computes the number of occurrences 
of q in W. 

Proof. The string W contains a substring s = (uv) 1 +t+l u. This substring 
contains at least i + + l occurrences of q — (uv) l u. We store a perfect hashing 
function so that for each factor q we associate a triplet (a q , j3 q ,r q ), where a q is a 
pointer to first occurrence of the factor, j3 q is number of occurrences of q and r q 
is the period of q (emphasize that we need r q only in case j3 q >2). What remains 
is to prove that W contains no more than i' + i" + 1 occurrences of q. The proof 
is by contradiction. Suppose that there is an occurrence W[a, a + m — 1] = q 
which was outside of the substring s = (uv) 1 +t+l u. We have two cases: 

• The occurrence W[a, a+m — 1] is at the left of the occurrence W[h, m— 1], 
that is a < h. In this case notice that h — a < h < \q/2\. We have q — 
W[a,a + m- 1] = W[h,m-1] meaning that q[t] = W[a + t] = W[h + a] = 
q[h — a + a] for any a such that h + t < \q\. This means that h — a< \q\/2 
is a period of q which moreover must be multiple of the shortest period uv. 
From there we deduce that q[0, h — a] = W[a, h] — (uv) b for some integer 
b which means that the string W[a, h] will be included in the substring 
uv) 1 by means of the longest suffix matching. Hence we conclude that the 
occurrence W[a, a + m— 1] is inside the substring s = (uv) 1 +t+l u. 

• The occurrence W[a, a+m— 1] is at the right of the occurrence W[h, m—1]. 
This is symmetric to the previous case and can be proved with a similar 
argument. 

I 

Let p be decomposed as p = yqz = y[uv) uz meaning that it contains c qiV = 
t' — t + 1 occurrences of q, the last step in the matching consists in comparing 
y against y' = q'[h - i'g - \y\,h - i'g - 1] = W[h - i'g -\y\,h- i'g - 1] and 
compare z against z' = q"[i"g, i"g + m — 1] = W[i"g + to, i"g + m+\z\ — 1]. We 
now distinguish four cases: 

4 q in this case (it hash a period of length less than |q|<2) is said to be periodic 
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1. y is a suffix of uv but z is not a prefix of vu. We require that y = y', 
z = z' and moreover that c q ^w > c 9jP . If the requirement is fulfilled then 
we have a single match ofp with the fragment i"g+\z\+m— 1]. 
Otherwise we do not have any match. 

2. z is a prefix of vu but y is not a suffix of uv. This case is symmetric to the 
previous one. We also require the same three conditions: y — y' ,z = z' 
and Cp^w > c q,v ^ the requirement is fulfilled, then we have a single 
match of p with the fragment W[h — i'g — \y\, h — i' g — \y\ + m — 1]. 

3. Neither y is suffix of nor z is prefix of vu. In this case we require y = y 1 ', 
z = z' and moreover that c q ^w = c q,p- If th e requirement is fulfilled, we 
have a single match of p with the fragment W^/i — i' g — \y\, h — i'g — \y\ + 
m — 1] , otherwise we have no match. 

4. z is a prefix of vu and ?/ is suffix of uv. In this case we require that 
Cq,W > Cq.p- Then in case both y — y' and z = z' we conclude that we have 
Cq,W ~ Cq.p+1 matches where first match is W[h— i'g— \y\, h—i'g—\y\+m—l] 
and the last match is W [i"g+\z\, i"g+\z\+m— 1] (the matches ). Otherwise 
we have c q yy — c 9jP matches: 

• In case y = y' but z ^ z', the first match is W[h — i' g — \y\, h — i'g — 
|y| +m-l] and the last one is W[(i" -l)g+\z\, (i" - l)g+ \z\ +m- 1]. 

• In case z = z' but ?/ 7^ y', the first match is W[h — (i' — l)g — \y\,h — 
(i' — V)g — \y\+m — 1] and the last one is W[i"g+ \z\, i"g+ \z\ + rn— 1]. 

Note that any two consecutive occurrences are separated by \uv\ char- 
acters and reporting all occurrences takes time linear in the number of 
occurrences. 

Lemma 15 The algorithm above correctly computes the occurrences of p in 
W. 

Proof. By lemma [T3l we can not have more than c q .w — c q , p + 1 occurrences of p 
in W. Thus to have at least a match we require that c q _w > c q,v Now consider 
the starting position a of an occurrence of p in W. The only possible values of 
a are of the kind h — i'g — \y\, h — i'g — \y\ + \uv\, , , h — i'g — |y| + \uv\ Cq - w ~ Cq < p . If 
Cq,w — Cq,p we conclude that we have a single possible match and this match is 
handled by the case 1, where the match is verified by matching the substrings 
z and y against the substrings z' and y' . Now in case c q y\? — c qp > 0, we could 
have more than one match. This case is handled by cases 2,3 and 4 in the 
algorithm above. We now prove that those 3 cases work correctly: we divide 
the set the matches into three categories, the leftest match, the Tightest match 
and the q,W — c q p — 1 middle matches. It can easily be verified that the middle 
matches are only possible in case z is a prefix of vu and y is suffix of uv. It 
can also be easily verified that leftest match is only be possible if y — y' and 
that z is prefix of vu. Likewise the rightest match is only possible if z = z' and 
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that y is suffix of uv. It can easily be checked that the three last cases correctly 
account for the matches. I 

Implementation We now analyze in detail the data structures needed for the 
matching. The first step of the matching is to match the string q = W[h, m — 1] 
against all factors of p of length m — h. Thus what we need is to build a 
dictionary on the set of factors of p of length m — h. In addition, in case q 
occurs as factor in p, we must proceed to a second step. In this second step we 
need to determine the factors u, v, y, z of p. Thus the required dictionary must 
fulfill the following needs: take only space 0(m log m) bits, answer in optimal 
0(m logCT ) and must return the necessary information to deduce the factors 
u, v, y, z of p. For our implementation we can use a very simple data structure: 
we store a perfect hashing function so that for each factor q we associate a triplet 
(a q , /3 q ,r q ), where a q is a pointer to first occurrence of q in p, /3 q is number of 
occurrences of q in p and r q is the period of q (note that we need r q only in case 
fi q >2). For implementing the first step we can just compared q with the factor 
p[a.q, dq + m — h — l]. For implementing the second step the factors u, v, y, z of q 
are also factors of p and their positions in p are easily deduced as combinations 
of the parameters a q (3 q and r q . 

Running time analysis We now analyze more precisely the running time 
of the matching. We can prove that all the running time of the matching on 
the window W take time 0(m^=-^ + occ) where occ is the number of reported 
occurrences. First, notice that in any case we are doing a constant number of 
operations among the following ones: 

• Matching q against the set P. 

• Longest suffix repetition search for uv in q' . 

• Longest prefix repetition search of vu in q". 

• Comparing z with z' and y with y'. 

Now the matching of q can be done in optimal 0(\q\ } -^f-) = O(m^f-) time 
by means of perfect hashing. The longest suffix repetition matching of uv in 
q' takes 0(\q'\-^-) = O(m^p). Likewise doing a longest prefix repetition 
matching of vu in q" takes O(m^). Finally comparisons of z with z' and of 
y with y' takes 0(m^f-) as the compared strings are of length 0(m). Thus we 
have proved the following lemma: 

Lemma 16 The matching of all occurrences of a string p of length m into a 
string W of length m+h where h < |_ TO /3J can be done in optimal 0(m^=^-+occ) 
time, where occ is the number of occurrences. 

6.2 Construction of the data structure 

The dictionary described above can easily be constructed in 0(m) time. 
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Determining the factors For determining the triplet (a q ,f3 q ,r q ) associated 
with each factor q, we begin by building the suffix tree of p. Then we do a DFS 
traversal of the suffix tree but in which the traversal is restricted to nodes of 
depth at most depth m — h. More precisely during the DFS traversal, at each 
time we reach a node n q whose path is of length at least m — h whose path is 
prefixed by factor q, we know that all the nodes in the subtree of n q will be 
have pointers to all occurrences of q. Then it suffices to count the size of the 
subtree of n q to get j3 q . For determining a q and r qi it suffices to traverse the 
pointers and select the two smallest ones a q and "f q . Then if f5 q > 2 we will 
have that r q = a q — j q . That is in case q occurs two times then the period 
of q is just the difference between the pointers to first and second occurrence 
of q. In order to state that factors which are equal, we use a temporary table 
ID[l..n] which is set during the traversal of the suffix tree as well as a table 
T of triplets. More precisely we attribute consecutive identifiers starting from 
zero to different patterns during the traversal of the suffix tree when we have a 
pattern q we give it identifier i and for each occurrence of q at position j we set 
ID[j] — i and also set T[j] — (a q , (3 q ,r q ). 

Construction of the perfect hash function We now turn our attention to 
the creation of the perfect hash function. The construction follows a standard 
procedure, that is first the set P is mapped injectively into a set of integers 
using a string hash function which transforms each string in P into an integer 
of O(logn) bits, followed by the computation of the perfect hash function on 
that set. For implementing the first step we do B = \w/\oga\ traversals of 
the pattern p. At each traversal i we will compute the hash values for the 
patterns which start at positions i,i + w/ log a,i + 2w/ log a.... At each traversal 
we compute the factor hash values in the following way: we conceptually divide 
each factor into blocks of size B except the last block potentially of size less than 
B which is padded with zero characters. Then we can compute the hash values 
using a polynomial hash function [13 , section 5] iJ 7 which is parametrized with 
a sufficiently large prime 7. That is we just use a table H in which we initialize 
H[0] = and then successively for increasing j set H[i] = H[0]® , y(Bp[i+jB, (j + 
1)B] (except at the last step where we set H[i] = H[0] (g> j ® p[i + jB, m — 1]). 
Then the hash value of associated with the factor occurring at position i + jB 
will be given by H[j + m/B] — H\j\/rf , Then before doing the second step, 
we check whether the computed hash values for different factors (two factors 
T[io] and T[i{\ are considered distinct if their corresponding triplets T[iq] and 
T[ii] differ) are all distinct (if the hash function iJ 7 is injective on the set P) 
and if it is not the case, redo the computation of the hash values using a newly 
chosen hash function H 1 parametrized by a new randomly chosen prime 7. This 
procedure is repeated until we get a set of distinct integer in which case we can 
proceed with the second step of the computation which consists in building the 
perfect hash function on the set of integers obtained in the first step. 
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6.3 Tabulation based solution 



We now give a proof of theorem 2J In the case m > a/2, theorem [3] already 
gives the required query time using just 0{m log m) bits of space. Thus we 
will focus on the case m < a/2. For this case there is a very simple solution: 
consider that at each step i (where i is initialized as zero) we match the substring 
s = T[ia/2,i(a/2) + a — 1] against all occurrences of the pattern p (consider 
w.l.o.g that a is an even number) which start anywhere inside s[0,a/2 — 1] = 
T[ia/2, (i + l)(a/2) — 1]. Using the four russian technique we can for every 
possible string s of length a, all positions of all occurences of p in s (if any) 
which start at any position in s[0,ia/2 — 1]. Note that every such occurrence 
must end at any position in s [a/ 2 — 1, a — 1]. As we have <j a strings of length 
a and we can have at most a/2 positions of occurrences, the space used by the 
table which stores all those occurrences for every possible string is thus just 
0(a a log a) bits. 

7 Conclusion 

In this paper, we have proposed four solutions to the problems of single and 
multiple pattern matching on strings in the RAM model. In this model we 
assume that we can read 0(u>/log<r) consecutive characters of any string in 
O(l) time. The first and third solutions have a query time which depends on 
the length of the shortest pattern (for multiple string matching) and the length 
of the only pattern respectively, in a way similar to that of the previous algo- 
rithms which aimed at average-optimal expected performance (not worst-case 
performance as in our case). Those two solutions achieve optimal query times 
if the shortest pattern (or the only pattern in the third solution) is sufficiently 
long. The second and fourth solutions have no dependence on the length of 
the shortest pattern but need to use additional precomputed space. They are 
interesting alternatives to the previous tabulation approaches by Bille [5] and 
Fredriksson [XT] . 

This paper gives rise to two interesting open problems: 

• In order to obtain any speedup we either rely on the length of the shortest 
pattern being long enough (theorem Q] and [3]) or have to use additional 
precomputed space (theorems [5] and 2] ) . An important open question is 
whether it is possible to obtain any speedup without relying on any of the 
two assumptions. 

• The space usage of both solutions is fi(mlogm) bits, but the patterns 
themselves occupy just (m log a) bits of space. The space used is thus at 
least a factor f^log^ m) larger than the space occupied by the patterns. An 
interesting open problem is whether it is possible to obtain an acceleration 
compared to the standard AC automaton while using only O (m logo - ) bits 
of space. 
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